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Abstract 


We introduce a generic scheme for accelerating first-order optimization methods 
in the sense of Nesterov, which builds upon a new analysis of the accelerated prox¬ 
imal point algorithm. Our approach consists of minimizing a convex objective 
by approximately solving a sequence of well-chosen auxiliary problems, leading 
to faster convergence. This strategy applies to a large class of algorithms, in¬ 
cluding gradient descent, block coordinate descent, SAG, SAGA, SDCA, SVRG, 
Finito/MISO, and their proximal variants. For all of these methods, we provide 
acceleration and explicit support for non-strongly convex objectives. In addition 
to theoretical speed-up, we also show that acceleration is useful in practice, espe¬ 
cially for ill-conditioned problems where we measure signihcant improvements. 

1 Introduction 

A large number of machine learning and signal processing problems are formulated as the mini¬ 
mization of a composite objective function F : —>■ K: 


[Pix) = f{x)+ ijj{x)''^ 


( 1 ) 


mm 


where / is convex and has Lipschitz continuous derivatives with constant L and if) is convex but may 
not be differentiable. The variable x represents model parameters and the role of / is to ensure that 
the estimated parameters ht some observed data. Specihcally, / is often a large sum of functions 



( 2 ) 


and each term fi{x) measures the ht between x and a data point indexed by i. The function -(/i in ([T]) 
acts as a regularizer; it is typically chosen to be the squared ^ 2 -norm, which is smooth, or to be a 
non-differentiable penalty such as the fi-norm or another sparsity-inducing norm ||2l. Composite 
minimization also encompasses constrained minimization if we consider extended-valued indicator 
functions ijj that may take the value -|-c» outside of a convex set C and 0 inside (see ini). 

Our goal is to accelerate gradient-based or first-order methods that are designed to solve O, with 
a particular focus on large sums of functions (|2|. By “accelerating”, we mean generalizing a mech¬ 
anism invented by Nesterov Qt) that improves the convergence rate of the gradient descent algo¬ 
rithm. More precisely, when '0 = 0, gradient descent steps produce iterates {xk)k>o such that 
F{xk) — F* = 0(1/k), where F* denotes the minimum value of F. Furthermore, when the objec¬ 
tive F is strongly convex with constant p,, the rate of convergence becomes linear in 0((1 — p/L)^). 
These rates were shown by Nesterov ifThll to be suboptimal for the class of hrst-order methods, and 
instead optimal rates— 0{l/k^) for the convex case and 0((1 — y/p/L)^) for the p-strongly con¬ 
vex one—could be obtained by taking gradient steps at well-chosen points. Later, this acceleration 
technique was extended to deal with non-differentiable regularization functions 0 El El. 

For modern machine learning problems involving a large sum of n functions, a recent effort has been 
devoted to developing fast incremental algorithms ||^r7l [T4ll24ll25ll27ll that can exploit the particular 
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structure of (|2]i. Unlike full gradient approaches which require computing and averaging n gradients 
V/(a;) = (1/n) ^fii^) every iteration, incremental techniques have a cost per-iteration 

that is independent of n. The price to pay is the need to store a moderate amount of information 
regarding past iterates, but the benefit is significant in terms of computational complexity. 

Main contributions. Our main achievement is a generic acceleration scheme that applies to a 
large class of optimization methods. By analogy with substances that increase chemical reaction 
rates, we call our approach a “catalyst”. A method may be accelerated if it has linear conver¬ 
gence rate for strongly convex problems. This is the case for full gradient ll4l [T^ and block coordi¬ 
nate descent methods cmi], which already have well-known accelerated variants. More impor¬ 
tantly, it also applies to incremental algorithms such as SAG 1241 . SAGA Q, Finito/MISO ir71 IT4l . 
SDCA m, and SVRG Whether or not these methods could be accelerated was an important 
open question. It was only known to be the case for dual coordinate ascent approaches such as 
SDCA Il26ll or SDPC Il28l for strongly convex objectives. Our work provides a universal positive an¬ 
swer regardless of the strong convexity of the objective, which brings us to our second achievement. 

Some approaches such as Finito/MISO, SDCA, or SVRG are only defined for strongly convex ob¬ 
jectives. A classical trick to apply them to general convex functions is to add a small regularization 
e||a;||^ EH. The drawback of this strategy is that it requires choosing in advance the parameter e, 
which is related to the target accuracy. A consequence of our work is to automatically provide a 
direct support for non-strongly convex objectives, thus removing the need of selecting e beforehand. 

Other contribution: Proximal MISO. The approach Finito/MISO, which was proposed in Q 
and lfT4ll . is an incremental technique for solving smooth unconstrained p-strongly convex problems 
when n is larger than a constant fiL/p (with /3 = 2 in ifTTlH . In addition to providing acceleration 
and support for non-strongly convex objectives, we also make the following specific contributions: 

• we extend the method and its convergence proof to deal with the composite problem ([T]); 

• we fix the method to remove the “big data condition” n > PL/p. 

The resulting algorithm can be interpreted as a variant of proximal SDCA ||25]| with a different step 
size and a more practical optimality certificate—that is, checking the optimality condition does not 
require evaluating a dual objective. Our construction is indeed purely primal. Neither our proof of 
convergence nor the algorithm use duality, while SDCA is originally a dual ascent technique. 

Related work. The catalyst acceleration can be interpreted as a variant of the proximal point algo¬ 
rithm |[3j|9l. which is a central concept in convex optimization, underlying augmented Lagrangian 
approaches, and composite minimization schemes ISJEO). The proximal point algorithm consists 
of solving O by minimizing a sequence of auxiliary problems involving a quadratic regulariza¬ 
tion term. In general, these auxiliary problems cannot be solved with perfect accuracy, and several 
notations of inexactness were proposed, including ||9l [TO] [22l . The catalyst approach hinges upon 
(i) an acceleration technique for the proximal point algorithm originally introduced in the pioneer 
work 13; (ii) a more practical inexactness criterion than those proposed in the pastQ As a result, we 
are able to control the rate of convergence for approximately solving the auxiliary problems with 
an optimization method A4. In turn, we are also able to obtain the computational complexity of the 
global procedure for solving ([T]i, which was not possible with previous analysis Il9l fTOl |22]| . When 
instantiated in different first-order optimization settings, our analysis yields systematic acceleration. 

Beyond Q, several works have inspired this paper. In particular, accelerated SDCA ESj is an 
instance of an inexact accelerated proximal point algorithm, even though this was not explicitly 
stated in ll26ll . Their proof of convergence relies on different tools than ours. Specifically, we use the 
concept of estimate sequence from Nesterov uni, whereas the direct proof of ||26l, in the context 
of SDCA, does not extend to non-strongly convex objectives. Nevertheless, part of their analysis 
proves to be helpful to obtain our main results. Another useful methodological contribution was the 
convergence analysis of inexact proximal gradient methods of ll2^ . Finally, similar ideas appear in 
the independent work la. Their results overlap in part with ours, but both papers adopt different 
directions. Our analysis is for instance more general and provides support for non-strongly convex 
objectives. Another independent work with related results is ca, which introduce an accelerated 
method for the minimization of finite sums, which is not based on the proximal point algorithm. 


’Note that our inexact criterion was also studied, among others, in 1221 . but the analysis of 1221 led to the 
conjecture that this criterion was too weak to warrant acceleration. Our analysis refutes this conjecture. 
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2 The Catalyst Acceleration 


We present here our generic acceleration scheme, which can operate on any first-order or gradient- 
based optimization algorithm with linear convergence rate for strongly convex objectives. 

Linear convergence and acceleration. Consider the problem ([TJ with a ^-strongly convex func¬ 
tion F, where the strong convexity is defined with respect to the ^ 2 -norm. A minimization algo¬ 
rithm A4, generating the sequence of iterates {xk)k>o, has a linear convergence rate if there exists 
Tj^ p in (0,1) and a constant Cm.f in R such that 

F{xk) — F* < , (3) 

where F* denotes the minimum value of F. The quantity tm.f controls the convergence rate: the 
larger is tm.f, the faster is convergence to F*. However, for a given algorithm M., the quantity 
tm,f depends usually on the ratio L/^, which is often called the condition number of F. 

The catalyst acceleration is a general approach that allows to wrap algorithm A4 into an accelerated 
algorithm A, which enjoys a faster linear convergence rate, with t^^f > Tm.f- As we will also see, 
the catalyst acceleration may also be useful when F is not strongly convex—that is, when /i = 0. In 
that case, we may even consider a method Ai that requires strong convexity to operate, and obtain 
an accelerated algorithm A that can minimize F with near-optimal convergence rate 0{l/k 2)3 

Our approach can accelerate a wide range of first-order optimization algorithms, starting from clas¬ 
sical gradient descent. It also applies to randomized algorithms such as SAG, SAGA, SDCA, SVRG 
and Finito/MISO, whose rates of convergence are given in expectation. Such methods should be 
contrasted with stochastic gradient methods Gsiiia, which minimize a different non-deterministic 
function. Acceleration of stochastic gradient methods is beyond the scope of this work. 

Catalyst action. We now highlight the mechanics of the catalyst algorithm, which is presented in 
Algorithm[T] It consists of replacing, at iteration k, the original objective function F by an auxiliary 
objective Gk, close to F" up to a quadratic term: 

Gk{x) = F{x) + |||a:- , 

where k will be specified later and yk is obtained by an extrapolation step described in (|6]l 
iteration fc, the accelerated algorithm A minimizes Gk up to accuracy e^. 

Substituting (HI to ([T]l has two consequences. On the one hand, minimizing (|4|l only provides an 
approximation of the solution of O, unless K = 0; on the other hand, the auxiliary objective Gk 
enjoys a better condition number than the original objective F, which makes it easier to minimize. 
For instance, when A4 is the regular gradient descent algorithm with tp = 0, A4 has the rate of 
convergence Q for minimizing F with = y/L. However, owing to the additional quadratic 

term, Gk can be minimized by A4 with the rate (O where TM,Gk = (f + tt)/{L + k) > tm,f- In 
practice, there exists an “optimal” choice for k, which controls the time required by M. for solving 
the auxiliary problems (|4|i, and the quality of approximation of F by the functions Gk- This choice 
will be driven by the convergence analysis in Sec. 13.111331 see also Sec.Ofor special cases. 

Acceleration via extrapolation and inexact minimization. Similar to the classical gradient de¬ 
scent scheme of Nesterov IfTTl . Algorithm[T]involves an extrapolation step As a consequence, the 
solution of the auxiliary problem (|5ll at iteration /c -|- 1 is driven towards the extrapolated variable yk- 
As shown in 191, this step is in fact sufficient to reduce the number of iterations of Algorithm [T] to 
solve ([T]i when Sk = 0—that is, for running the exact accelerated proximal point algorithm. 

Nevertheless, to control the total computational complexity of an accelerated algorithm A, it is nec¬ 
essary to take into account the complexity of solving the auxiliary problems Q using M.. This 
is where our approach differs from the classical proximal point algorithm of 0. Essentially, both 
algorithms are the same, but we use the weaker inexactness criterion Gk{xk) — G\ < Sk, where the 
sequence {£k)k>o is fixed beforehand, and only depends on the initial point. This subtle difference 
has important consequences: (i) in practice, this condition can often be checked by computing dual¬ 
ity gaps; (ii) in theory, the methods Ai we consider have linear convergence rates, which allows us 
to control the complexity of step Q, and then to provide the computational complexity of A- 

^In this paper, we use the notation 0(.) to hide constants. The notation 0(.) also hides logarithmic factors. 


(4) 

. Then, at 
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Algorithm 1 Catalyst 

input initial estimate xo S MP, parameters k and ao. sequence {ek)k>o, optimization method Al; 
1: Initialize g = p/(p + k) and yo = xq', 

2: while the desired stopping criterion is not satisfied do 

3: Find an approximate solution of the following problem using Ai 

Xk ^ argminlGkix) = F{x) +'^Wx - yk-iW^] such that Gk{xk) - Gl < Sk- (5) 
xSRP Z ) 

4: Compute ak & (0,1) from equation = (1 — ak)a\_^ + quk', 

5: Compute 

yk = Xk + Pkixk - Xk-i) with /3fc = Qfc-i) ^ 

6: end while 

output Xk (final estimate). 


3 Convergence Analysis 

In this section, we present the theoretical properties of Algorithm [1] for optimization methods Ai 
with deterministic convergence rates of the form When the rate is given as an expectation, a 
simple extension of our analysis described in Section|4]is needed. For space limitation reasons, we 
shall sketch the proof mechanics here, and defer the full proofs to AppendixlBl 


3.1 Analysis for p-Strongly Convex Objective Functions 


We first analyze the convergence rate of Algorithm [T] for solving problem[T] regardless of the com¬ 
plexity required to solve the subproblems (|5]l. We start with the p-strongly convex case. 

Theorem 3.1 (Convergence of Algorithm[TJ p-Strongly Convex Case). 

Choose ao = y/q with q = + k) and 


£k = g{F{xo) - F*){1 - p)'" with p < ,/q. 


Then, Algo rithm\I\gene rates iterates {xk)k>o such that 


F{xk) -F* < Gil- py+^ (Fixo) - F*) with G = 


is/q- P?' 


(7) 


This theorem characterizes the linear convergence rate of Algorithm [T| It is worth noting that the 
choice of p is left to the discretion of the user, but it can safely be set to p = 0.9,Jq in practice. 
The choice ao = ^/q was made for convenience purposes since it leads to a simplified analysis, but 
larger values are also acceptable, both from theoretical and practical point of views. Following an 
advice from Nesterov lflTl page 81] originally dedicated to his classical gradient descent algorithm, 
we may for instance recommend choosing ao such that a^ + (1 — g')ao — 1 = 0. 

The choice of the sequence i£k)k>o is also subject to discussion since the quantity Fixo) — F* is 
unknown beforehand. Nevertheless, an upper bound may be used instead, which will only affects 
the corresponding constant in (|2l). Such upper bounds can typically be obtained by computing a 
duality gap at xq, or by using additional knowledge about the objective. For instance, when F is 
non-negative, we may simply choose £k = (2/9)F(a:o)(l — p)^- 

The proof of convergence uses the concept of estimate sequence invented by Nesterov 113, and 
introduces an extension to deal with the errors {£k)k>o- To control the accumulation of errors, we 
borrow the methodology of ll23l for inexact proximal gradient algorithms. Our construction yields a 
convergence result that encompasses both strongly convex and non-strongly convex cases. Note that 
estimate sequences were also used in ||9|, but, as noted by ll22ll . the proof of ||9l only applies when 
using an extrapolation step (|6]l that involves the true minimizer of (|5]l, which is unknown in practice. 
To obtain a rigorous convergence result like Q, a different approach was needed. 
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Theorem B.ll is important, but it does not provide yet the global computational complexity of the full 
algorithm, which includes the number of iterations performed by j\4 for approximately solving the 
auxiliary problems (|5]l. The next proposition characterizes the complexity of this inner-loop. 

Proposition 3.2 (Inner-Loop Co mplex ity, /i-Strongly Convex Case). 

Under the assumptions ofTheorem \3.1\ let us consider a method A4 generating iterates {zt)t>o for 
minimizing the function Gk with linear convergence rate of the form 

Gk{zt) - Gl < ^(1 - TMf{Gk[zo) - Gl). (8) 

When zq = Xk-i, the precision Sk is reached with a number of iterations Tj^ = where 

the notation O hides some universal constants and some logarithmic dependencies in p, and k. 

This proposition is generic since the assumption ([8]l is relatively standard for gradient-based meth¬ 
ods Dll. It may now be used to obtain the global rate of convergence of an accelerated algorithm. 
By calling Fg the objective function value obtained after performing s = kT^ iterations of the 
method Ai, the true convergence rate of the accelerated algorithm A is 

Fs-F*=F -F*< C(1 - p)^iFixo) - F*) < G (^1 - (F(xo) - F*). (9) 

As a result, algorithm A has a global linear rate of convergence with parameter 

ta,f = pITm = 0{TMs/lj./\/p- + k ), 

where tm typically depends on k (the greater, the faster is M.). Consequently, k will be chosen to 
maximize the ratio tm /sJP + Note that for other algorithms M. that do not satisfy (HI, additional 
analysis and possibly a different initialization zq may be necessary (see Appendix iDlfor example). 


3.2 Convergence Analysis for Convex but Non-Strongly Convex Objective Functions 


We now state the convergence rate when the objective is not strongly convex, that is when p = 0. 

Theorem 3.3 (Convergence of Algorithm[TJ Convex, but Non-Strongly Convex Case). 

When p = 0, choose ao = (v/S — l)/2 and 


£k 


2{F{xo) - F*) 

9{k + 2)^+^ 


with ry > 0. 


Then, Algorithm\I\generates iterates {xk)k>o such that 





{F{xo)-F*) + ^\\xo 



( 10 ) 


( 11 ) 


This theorem is the counter-part of Theorem l3 .1 I when p = 0. The choice of ry is left to the discretion 
of the user; it empirically seem to have very low influence on the global convergence speed, as long 
as it is chosen small enough (e.g., we use ?y = 0.1 in practice). It shows that Algorithm[T]achieves the 
optimal rate of convergence of first-order methods, but it does not take into account the complexity 
of solving the subproblems (0. Therefore, we need the following proposition: 

Proposition 3.4 (Inner-Loop Complexity, Non-Strongly Convex Case). _ 

Assume that F has bounded level sets. Under the assumptions of Theorem 13.31 let us consider a 
method Ai generating iterates {zt)t>o for minimizing the function Gk with linear convergence rate 
of the form (0- Then, there exists Tm = 0{1/tm), such that for any k > 1, solving Gk with initial 
point Xk-i requires at most Tm log(A: + 2) iterations of Ai. 


We can now draw up the global complexity of an accelerated algorithm A when Ai has a lin¬ 
ear convergence rate ([8]) for K-strongly convex objectives. To produce Xk, Ai is called at most 
kTM log(A: + 2) times. Using the global iteration counter s = kTM log(A: -I- 2), we get 


Fs-F* < 


8r^iog^(s) 

„2 



(F(xo)-F*) + |||xo 



( 12 ) 


If AT is a first-order method, this rate is near-optimal, up to a logarithmic factor, when compared to 
the optimal rate 0(l/s^), which may be the price to pay for using a generic acceleration scheme. 
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4 Acceleration in Practice 


We show here how to accelerate existing algorithms A4 and compare the convergence rates obtained 
before and after catalyst acceleration. For all the algorithms we consider, we study rates of conver¬ 
gence in terms of total number of iterations (in expectation, when necessary) to reach accuracy e. 
We first show how to accelerate full gradient and randomized coordinate descent algorithms im. 
Then, we discuss other approaches such as SAG Il24l . SAGA or SVRG Il27l . Finally, we present 
a new proximal version of the incremental gradient approaches Finito/MISO Elia, along with its 
accelerated version. Table l4.l] summ arizes the acceleration obtained for the algorithms considered. 

Deriving the global rate of convergence. The convergence rate of an accelerated algorithm A is 
driven by the parameter k. In the strongly convex case, the best choice is the one that maximizes 
the ratio TM,Gkl\Jb' + tt. As discussed in Appendix O this rule also holds when ® is given in 
expectation and in many cases where the constant is different than A{Gk{zo) — Gff from (|8]l. 

When /i = 0, the choice of k > 0 only affects the complexity by a multiplicative constant. A rule 
of thumb is to maximize the ratio TM,Gk/^ ^ (s®® AppendixIClfor more details). 

After choosing k, the global iteration-complexity is given by Comp < kgut, where ki„ is an upper- 
bound on the number of iterations performed by j\4 per inner-loop, and fcout is the upper-bound on 
the number of outer-loop iterations, following from Theorems B. 1113.31 Note that for simplicity, we 
always consider that L ^ /r such that we may write L — simply as “L” in the convergence rates. 

4.1 Acceleration of Existing Algorithms 

Composite minimization. Most of the algorithms we consider here, namely the proximal gradient 
method lHHH, SAGA ||6|, (Prox)-SVRG El, can handle composite objectives with a regularization 
penalty t/j that admits a proximal operator prox^, defined for any z as 



Table 14.11 presents convergence rates that are valid for proximal and non-proximal settings, since 
most methods we consider are able to deal with such non-differentiable penalties. The exception is 
SAG ll24l . for which proximal variants are not analyzed. The incremental method Finito/MISO has 
also been limited to non-proximal settings so far. In Section l4~2l we actually introduce the extension 
of MISO to composite minimization, and establish its theoretical convergence rates. 

Full gradient method. A hrst illustration is the algorithm obtained when accelerating the regular 
“full” gradient descent (FG), and how it contrasts with Nesterov’s accelerated variant (AFG). Here, 
the optimal choice for k is L — 2fi. In the strongly convex case, we get an accelerated rate of 
convergence in 0{ny^L/^\og{l/e)), which is the same as AFG up to logarithmic terms. A similar 
result can also be obtained for randomized coordinate descent methods ll2Ti . 

Randomized incremental gradient. We now consider randomized incremental gradient methods, 
resp. SAG ll24l and SAGA ||6l. When /i > 0, we focus on the “ill-conditioned” setting n < L/fi, 
where these methods have the complexity 0{{L/^) log(l/£:)). Otherwise, their complexity becomes 
0{n log(l/e)), which is independent of the condition number and seems theoretically optimal ||T|. 

For these methods, the best choice for k has the form k = a{L — /i)/(n -I- 6) — p,, with (a, h) = 
(2, —2) for SAG, (a, h) = (1/2,1/2) for SAGA. A similar formula, with a constant L' in place of 
L, holds for SVRG; we omit it here for brevity. SDCA ||2^ and Finito/MISO E [l4l are actually 
related to incremental gradient methods, and the choice for k has a similar form with (a, &) = (!, 1). 

4.2 Proximal MISO and its Acceleration 

Finito/MISO was proposed in E and lfT4l for solving the problem E when ^ = 0 and when / is 
a sum of n p-strongly convex functions fi as in E, which are also differentiable with L-Lipschitz 
derivatives. The algorithm maintains a list of quadratic lower bounds—say at iteration k — 

of the functions fi and randomly updates one of them at each iteration by using strong-convexity 
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Comp, p > 0 

Comp, /i = 0 

Catalyst p > 0 

Catalyst fi — 0 

FG 

O (n 


OH) 

o(’^\/^log(i)) 

6(»*) 

SAG (241 

SAGA® 



o(v¥iog(l)) 

Finito/MISO-Prox 

SDCAfBl 

not avail. 

SVRG (27) 



o(x/Hi°s(i)) 

Acc-FG (HI 

O (n 

A/Fl«g(7)) 


no acceleration 

Acc-SDCA (26| 

As 

/f log(i)) 

not avail. 


Table 1: Comparison of rates of convergence, before and after the catalyst acceleration, resp. in 
the strongly-convex and non strongly-convex cases. To simplify, we only present the case where 
n < L//r when p > 0. For all incremental algorithms, there is indeed no acceleration otherwise. 
The quantity L' for SVRG is the average Lipschitz constant of the functions fi (see W^)■ 


inequalities. The cutTent iterate Xk is then obtained by minimizing the lower-bound of the objective 


= argmin Dk{x) = - V df^^x) \ . 


Xk 


(13) 


Interestingly, since Dk is a lower-bound of F we also have Dk{xk) < F*, and thus the quantity 
F{xk) — Dk{xk) can be used as an optimality certificate that upper-bounds F{xk) — F*. Further¬ 
more, this certificate was shown to converge to zero with a rate similar to SAG/SDCA/S VRG/SAGA 
under the condition n > 2L/p. In this section, we show how to remove this condition and how to 
provide support to non-differentiable functions r/; whose proximal operator can be easily computed. 
We shall briefly sketch the main ideas, and we refer to AppendixlDlfor a thorough presentation. 


The first idea to deal with a nonsmooth regularizer ip is to change the definition of Dk'. 


,L 

Dk{x) = - ^4ix) +ipix), 

i—\ 


which was also proposed in 0 without a convergence proof. Then, because the d^’s are quadratic 
functions, the minimizer Xk of Dk can be obtained by computing the proximal operator of ^ at a 
particular point. The second idea to remove the condition n > 2L//r is to modify the update of the 
lower bounds d^. Assume that index ik is selected among {1,..., n} at iteration k, then 

(1 - 6)d'p~^{x)+ S{f^{xk-i) + {'^hixk-i),x - Xk-i) + §\\x - Xk-i\\^) if 


di (x) = 


jk—l 


ix) 


I - 

Otherwise 


Whereas the original Finito/MISO uses 5 = 1, our new variant uses 5 = min(l,/m/2(L — ^)). 
The resulting algorithm turns out to be very close to variant “5” of proximal SDCA ll25]l . which 
corresponds to using a different value for 6. The main difference between SDCA and MISO- 
Prox is that the latter does not use duality. It also provides a different (simpler) optimality cer¬ 
tificate F{xk) — Dk{xk), which is guaranteed to converge linearly, as stated in the next theorem. 
Theorem 4.1 (Convergence of MISO-Prox). 

Let {xk)k>o be obtained by MISO-Prox, then 

- F* < i(l - (F(xo) - Do{xo)) with t > min ^|. (14) 

Furthermore, we also have fast convergence of the certificate 

E[F(a:fe) - Dk^Xk)] < -(1 - r)'" {F* - Doixo)) ■ 

T 


The proof of convergence is given in Appendix iDl Finally, we conclude this section by noting that 
MISO-Prox enjoys the catalyst acceleration, leading to the iteration-complexity presented in Ta- 
ble l4.ll Since the convergence rate (fT4l) does not have exactly the same form as (l8]l. Propositions [T2] 
and 13.41 cannot be used and additional analysis, given in Appendix iDl is needed. Practical forms of 
the algorithm are also presented there, along with discussions on how to initialize it. 
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5 Experiments 


We evaluate the Catalyst acceleration on three methods that have never been accelerated in the past; 
SAG ll24l . SAGA Q, and MISO-Prox. We focus on ^ 2 -regularized logistic regression, where the 
regularization parameter fi yields a lower bound on the strong convexity parameter of the problem. 

We use three datasets used in m, namely real-sim, rcvl, and ocr, which are relatively large, with 
up to n = 2 500 000 points for ocr and p = 47152 variables for rcvl. We consider three regimes; 
p = 0 (no regularization), p/L = 0.001/n and /r/L = 0.1/n, which leads significantly larger 
condition numbers than those used in other studies {p,/L k, 1/n in lfT4ll2^ '). We compare MISO, 
SAG, and SAGA with their default parameters, which are recommended by their theoretical analysis 
(step-sizes 1/L for SAG and 1/3L for SAGA), and study several accelerated variants. The values of 
K and p and the sequences {£k)k>o are those suggested in the previous sections, with p = 0.1 in Coll. 
Other implementation details are presented in Appendix|E] 


The restarting strategy for A4 is key to achieve acceleration in practice. All of the methods we com¬ 
pare store n gradients evaluated at previous iterates of the algorithm. We always use the gradients 
from the previous run of A4 to initialize a new one. We detail in Appendix |E] the initialization for 
each method. Finally, we evaluated a heuristic that constrain A4 to always perform at most n iter¬ 
ations (one pass over the data); we call this variant AMIS02 for MISO whereas AMISOl refers to 
the regular “vanilla” accelerated variant, and we also use this heuristic to accelerate SAG. 

The results are reported in Table[T] We always obtain a huge speed-up for MISO, which suffers from 
numerical stability issues when the condition number is very large (for instance, pf L = 10“^/n = 
4.10“^° for ocr). Here, not only does the catalyst algorithm accelerate MISO, but it also stabilizes 
it. Whereas MISO is slower than SAG and SAGA in this “small p” regime, AM1S02 is almost 
systematically the best performer. We are also able to accelerate SAG and SAGA in general, even 
though the improvement is less significant than for MISO. In particular, SAGA without acceleration 
proves to be the best method on ocr. One reason may be its ability to adapt to the unknown strong 
convexity parameter p' > p of the objective near the solution. When p'/L >1 jn, we indeed obtain 
a regime where acceleration does not occur (see Sec. Si. Therefore, this experiment suggests that 
adaptivity to unknown strong convexity is of high interest for incremental optimization. 





Figure 1; Objective function value (or duality gap) for different number of passes performed over 
each dataset. The legend for all curves is on the top right. AMISO, ASAGA, ASAG refer to the 
accelerated variants of MISO, SAGA, and SAG, respectively. 
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In this appendix, Section 0 is devoted to the construction of an object called estimate sequence, 
originally introduced by Nesterov (see El), and introduce extensions to deal with inexact mini¬ 
mization. This section contains a generic convergence result that will be used to prove the main 
theorems and propositions of the paper in Section|B] Then, Section|Ois devoted to the computation 
of global convergence rates of accelerated algorithms. Section |D] presents in details the proximal 
MISO algorithm, and Section|^gives some implementation details of the experiments. 


A Construction of the Approximate Estimate Sequence 


The estimate sequence is a generic tool introduced by Nesterov for proving the convergence of 
accelerated gradient-based algorithms. We start by recalling the definition given in IfTTl . 

Definition A.l (Estimate Sequence El)- 

A pair of sequences {ipk)k>o and {Xk)k>o, with Afc > 0 and : Rp —>■ R, is called an estimate 
sequence of function F if 

A/c —>■ 0, 

and for any x in Rp and all fc > 0, we have 

V’k(x) < (1 - Afc)F(x) -I- Xk(po{x). 


Estimate sequences are used for proving convergence rates thanks to the following lemma 
Lemma A.2 (Lemma 2.2.1 from ifiTl ). 

If for some sequence {xk)k>o we have 

F{xk)<<fl = min V5fc(x), 

for an estimate sequence ((/?fe)fc>o of F> then 

Fixk) - F* < XkiMx*) - F*), 

where x* is a minimizer of F. 


The rate of convergence of F(a:fc) is thus directly related to the convergence rate of Afc. Constructing 
estimate sequences is thus appealing, even though finding the most appropriate one is not trivial for 
the catalyst algorithm because of the approximate minimization of Gk in (l5]l. In a nutshell, the main 
steps of our convergence analysis are 


1. define an “approximate” estimate sequence for F corresponding to Algorithm [T] —that is, 
finding a function ip that almost satisfies Definition |AT| up to the approximation errors £k 
made in (|5]l when minimizing Gk, and control the way these errors sum up together. 

2. extend Lemma lA^ to deal with the approximation errors Sk to derive a generic convergence 
rate for the sequence {xk)k>o- 


This is also the strategy proposed by Giiler in 191 for his inexact accelerated proximal point al¬ 
gorithm, which essentially differs from ours in its stopping criterion. The estimate sequence we 
choose is also different and leads to a more rigorous convergence proof. Specifically, we prove in 
this section the following theorem; 


Theorem A.3 (Convergence Result Derived from an Approximate Estimate Sequence). 

Let us denote 


k-l 


Afe=n(i- 


Oii) 


i=0 


where the ai’s are defined in Algorithm\I\ Then, the sequence {xk)k>o satisfies 

'£i)^ 
aJ ’ 


F{xk)-F* < Xk 2^ 

where F* is the minimum value of F and 


(15) 


(16) 


i=l 


O ^ 77* , To II ,||2 , G , ao {{k + n)ao - ii) 

Sk = F{xo)-F -f--||a;o - a; II To =-^-, (17) 

2 ^ Aj 1 - ao 


where x* is a minimizer of F. 
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Then, the theorem will be used with the following lemma from ini to control the convergence rate 
of the sequence {Xk)k>o, whose definition follows the classical use of estimate sequences flTl . This 
will provide us convergence rates both for the strongly convex and non-strongly convex cases. 

Lemma A.4 (Lemma 2.2.4 from iflTll ). 

If in the quantity 70 defined in m satisfies 70 > p, then the sequence {\k)k>t) from (O satisfies 


Xk < min 






(18) 


We may now move to the proof of the theorem. 

A.l Proof of Theorem |A3] 

The first step is to construct an estimate sequence is typically to find a sequence of lower bounds 
of F. By calling the minimizer of Gk, the following one is used in 191 : 

Lemma A.5 (Lower Bound for F near xl). 

For all X in R*’, 


F{x) > F{xl) + {nijjk-i - x*k),x- xl) + ^\\x - • 


(19) 


Proof. By strong convexity, Gk{x) > Gk(xl) + ^^^\\x — which is equivalent to 

F(.x) + |l|a: - 2 /fef > F{xl) + ‘^Wxl - yk-iV + - xlf. 

After developing the quadratic terms, we directly obtain ( fT^ . □ 


Unfortunately, the exact value xl is unknown in practice and the estimate sequence of ||9l yields 
in fact an algorithm where the definition of the anchor point yk involves the unknown quantity xl 
instead of the approximate solutions Xk and Xk-i as in (| 6 |, as also noted by others ll22l . To obtain a 
rigorous proof of convergence for Algorithm[T] it is thus necessary to refine the analysis of To 
that effect, we construct below a sequence of functions that approximately satisfies the definition of 
estimate sequences. Essentially, we replace in ( fT9b the quantity xl by Xk to obtain an approximate 
lower bound, and control the error by using the condition Gk{xk) — Gk{xl) < Sk- This leads us to 
the following construction: 

1. ffrx) = F{xo) + ^||x - xop; 

2. For A: > 1, we set 

(j)k{x) = (1 - ak-i)(l)k-i{x) + ak-i[F{xk) + {niyk-i - Xk),x- Xk) + ^\\x - Xk\\\ 

where the value of 70 , given in (fTTI) will be explained later. Note that if one replaces Xk by xl in 
the above construction, it is easy to show that {(j)k)k>o would be exactly an estimate sequence for F 
with the relation Xk given in (fTsT l. 

Before extending Lemma IA.2I to deal with the approximate sequence and conclude the proof of 
the theorem, we need to characterize a few properties of the sequence {(j>k)k>o- In particular, the 
functions fk are quadratic and admit a canonical form: 

Lemma A.6 (Canonical Form of the Functions fk)- 

For all k > 0, fik can be written in the canonical form 

fkix) = cj)l + ylk-'Wfef, 
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where the sequences {'jk)k>o, {,Vk)k>o> ond {(j)k)k>o defined as follows 

7fc = (1 - ak-i)^k-i + ak-ilk, 

Vk = — ((1 - ak-i)'fk-iVk-i + ak-l^iXk - ak-iK{yk-i - Xk)), 


( 21 ) 


( 20 ) 



Proof. We have for all fc > 1 and all a: in 



Differentiate twice the relations (l2?t gives us directly (l20l) . Since Vk minimizes fik, the optimality 
condition V(j)k{vk) = 0 gives 

(1 - ak-i)ik-i{vk - Vk-i) + ctfe-i {n{yk-i - Xk) + p{vk - Xk)) = 0, 

and then we obtain (l2Tl l. Finally, apply x = Xkto (l2?t . which yields 



\\xk - + ak-iF{xk) = ^1 + ylkfe - t'felP- 


Consequently, 


= (1 - ak-i)4>*k_i + ak-iF{xk) + (1 - ak-i)'^^^^\\xk - Vk-i\\'^ - - t'felP (24) 



2 (1 - afc-i) tLi II 

=-5-kfc 

2jk 

(1 — afe i)afc i 7 fc i 

Ik 


^\\xk-Vk\\ 


\\xk Xk—1 II 


2 



\\K.{yk-i - XkW- 


It remains to plug this relation into (l24ll . use once (l20l l. and we obtain the formula (l22ll for (j)^. □ 

We may now start analyzing the errors £k to control how far is the sequence {(j)k)k>o from an exact 
estimate sequence. For that, we need to understand the effect of replacing x^. by Xk in the lower 
bound (fT^ . The following lemma will be useful for that purpose. 

Lemma A.7 (Controlling the Approximate Lower Bound of F). 

IfGk{xk) — Gkixff) < £k, then for all x in M?’, 

F{x) > F{xk) + {tt{yk-i - Xk),x-Xk) + ^\\x-Xk\\‘^ + {K + y){xk -xl,x- Xk) - £k- (25) 
Proof. By strong convexity, for all a; in 


Gk{x) >G*k^ -^ 
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where GJ is the minimum value of Gk- Replacing Gk by its definition (|5]l gives 



= Gk{xk) + {Gl - Gkixk)) + - Vk-if 


> Gk{xk) - Efe + ^ ^ ^ 11 ( 3 : - Xk) + {Xk - xl)\f - ^\\x- J/fc-l||^ 


> Gk{xk) - ek+ ^ ^ ^ ||a: - Xk\\‘^ - ^\\x - j//c-i|P + (k + ^l){xk - x*k,x- Xk). 

We conclude by noting that 



We can now show that Algorithm[T]generates iterates {xk)k>o that approximately satisfy the condi¬ 
tion of Lemma lAl^ from Nesterov E). 

Lemma A.8 (Relation between {(j)k)k>o and Algorithm[l). 

Let (j)k be the estimate sequence constructed above. Then, Algo rithm\I\ gene rates iterates {xk)k>o 
such that 

F{xk) <<t)l+^k, 

where the sequence {^k)k>o is defined by = 0 and 

= (1 - ak-i)iik-i +ek- (kA- iJL){xk - x*k,xk-i - Xk))- 

Proof We proceed by induction. For k = 0, (j)Q = F{xq) and ^0 = 0. 

Assume now that F{xk-i) < fX-i + Cfe-i- Then, 

0/c-l — F{Xk-l) ~ C/c-l 

> F{xk) + {niyk-i - Xk), Xk-i - Xk) + (k + p){xk - x*k,Xk-i - Xk) - Sk - Cfe-i 

= F{xk) + {K{yk-i - Xk), Xk-i - Xk) - Cfe/(1 - a/c-i), 

where the second inequality is due to (l25l l. By Lemma IA!61 we now have, 



> (1 - ak-i) {F{xk) + {nivk-i - Xk),Xk-i - Xk)) - 4 + ak-iF{xk) 


27fc 


\\K{yk-i - Xk)\\'^ + 


afc-i(l — afc-i)7fe-i 
Ik 


(K(t/fc_l - Xk),Vk-l - Xk). 


F{xk) + (1 - ak-i){n{yk-i - Xk),Xk-i - Xk A 


OLk-iyk-i 

Ik 


{Vk-l - Xk)) 


2 

- -:^\\i^{yk-i - Xk)\\'^ - Cfe 


F{xk) + (1 - ak-i){K{yk-i - Xk),Xk-i - yk-i A 


Olk-llk-l 

Ik 


[vk-i - yk-i)) 



We now need to show that the choice of the sequences {ak)k>o and {yk)k>o will cancel all the terms 
involving yk-i — Xk. In other words, we want to show that 
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and we want to show that 


1-(« + m)^ = 0, (27) 

Ik 

which will be sufficient to conclude that (/)^ + ■Cfc > F{xk)- The relation (|27] | can be obtained from 
the definition of ak in I© and the form of 7 /c given in (l20l) . We have indeed from (|6l) that 

{k + n)al = (1 - ak)iK + + Ofe/i. 

Then, the quantity {k + /i)a^ follows the same recursion as 7 fe+i in (l20l l. Moreover, we have 

71 = (1 - Q!o) 7 o + ^J.ao = (k + n)al, 

from the definition of 70 in (fTTl) . We can then conclude by induction that 7 fc+i = (k + /i)a^ for 
alU ^ 0 and (|27j) is satisfied. 

To prove (l26l l. we assume that yk-i is chosen such that (|2^ is satisfied, and show that it is equivalent 
to defining yk as in (| 6 ]l. By lemma lAlhl 


Vk = — ({1 - Q;fc-i)7fe-it;fe-i + ak-ifJ-Xk - ak-iK{yk-i - Xk)) 

Ik 


1 / (1 — Ofe-l) 


Ik 


<Xk-l 


1 / (1 — Oik-l) 


Ik 


OLk-1 


((7fe + ak-i^k-i)yk-i - ikXk-i) + ak-iyiXk - ak-in^yk-i - Xk) 


((7/c_i + ak-iy)yk-i - IkXk-i) + a.k-i{y + n)xk - ak-iKyk-i 


1 / 1 


ilk - yal_^)yk-i - 


Ik \Oik-l 

—^{xk - (1 - Q!fe_i)a;fc-i), 

Otk-l 


(1 - gfc-i) 
Oik-l 


'JkXk-l + 


Ik 

Oik-l 


-Xk Oik—lkiyk—1 


(28) 


As a result, using ( |2^ by replacing fc — 1 by fc yields 


yk — Xk 2 j yXk Xk—1 


^fe -1 




and we obtain the original equivalent definition of (| 6 ll. This concludes the proof. 


□ 


With this lemma in hand, we introduce the following proposition, which brings us almost to Theo- 
rem lA.3l which we want to prove. 

Proposition A.9 (Auxiliary Proposition for Theorem lA.3> . 

Let us consider the sequence {Xk)k>o defined in (O- Then, the sequence {xk)k>o satisfies 


l-{F{xk)-F* + i^\ 


\x*-Vkf) < Mx*)-F* + 


k 

£i 

^ Xi 
2 = 1 


k 

E 

2=1 


V2ei7i 

A. 


a: -ud 


where x* is a minimizer of F and F* its minimum value. 


Proof. By the definition of the function fik, we have 

fikix*) = (1 - ak-i)4>k-iix*) + ak-i[F{xk) + (^(t/fe-i - Xk),x* - Xk) + ^\\x* - Xk\\^] 

< (1 - ak-i)(t>k-i{x*) + ak-i[F{x*) + Efc - (k + y.){xk - xl,x* - Xk)], 
where the inequality comes from (l25l) . Therefore, by using the definition of ^k in Lemma I aTsI 
Mx*)+fk-F* 

< (1 - ak-i)i(l)k-iix*) + ^k-i - F*) + Efc - (k + fi){xk - x*k, (1 - ak-i)xk-i + ak-ix* - Xk) 

= (1 - afc_i)((/)fe_i(x*) + ^k-i - F*) +ek - ak-i{n + y){xk - x*k,x* - Vk) 

< (1 - ak-i){(t>k-i{x*) + Cfc-i - F*) +ek + ak-i{K + y)\\xk - x*k\\\\x* - -Ufc|| 

< (1 - ak-i){4>k-i{x*) +^fc_i - F*) + Ek + ak-i\/2{K + y,)£k\\x* - UfcH 

= (1 - ak-iifik-iix*) + ik-i - F*) + Efc + y'2£klk\\x* -VkW, 
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where the first equality uses the relation (|28]) . the last inequality comes from the strong convex¬ 
ity relation et > Gk{xk) — Gk{x\) > (1/2)(k -|- /r)||a;^ — XkW^, and the last equality uses the 
relation 7 ^ = (k -(- 


Dividing both sides by Xk yields 

+ 6 - F*) < ^id^k-iix*) + 6_1 -F*) + ^ + - t;fc||. 

A/c Ak Ak 

A simple recurrence gives, 

+a - F*) < + E r + E 

Finally, by lemmas IaTSI and IaTsI 

Mx*) +^k-F*= ^\\x* - Ufcf + cl>l+^k-F*> ^\\x* - Vkf + Fixk) - F*. 


£i , II * II 

^ir + E—i—11^ “^*11- 

Z=1 i=l 


As a result, 
1 

Xk 


^{F{xk) -F* + :f\\x* - ufef) < + E r + E - ^*li- (29) 


, -^2 . , Xi 

2—1 2—1 


□ 


To control the error term on the right and finish the proof of Theorem IA.31 we are going to bor¬ 
row some methodology used to analyze the convergence of inexact proximal gradient algorithms 
from II 23 I , and use an extension of a lemma presented in Il2^ to bound the value of 11Ui — a:* |j. This 
lemma is presented below. 

Lemma A.IO (Simple Lemma on Non-Negative Sequences). 

Assume that the nonnegative sequences {uk)k>o O-nd {ak)k>o satisfy the following recursion for all 
k>0: 

k 

ul < Sk a^u^, (30) 

where {Sk)k>o A an increasing sequence such that Sq > Uq. Then, 


Moreover, 


Uk < 


1 

2 


k 

E«* + 

2=1 




^ ^ 2 

( 2 E®*) 


k / ^ \ 

Sk ^ ^ E v^+E«0 

2=1 \ 2 = 1 / 


(31) 


Proof The first part—that is, Eq. dSTT l— is exactly Lemma 1 from 12^ . The proof is in their ap¬ 
pendix. Then, by calling bk the right-hand side of OTI) . we have that for all k > 1, Uk < bk- 
Furthermore {bk)k>Q is increasing and we have 

k k k 

Sk aiUi < SkF^^ aih < Sk + (y^^ai)bk = b^, 

i—1 i—1 i—1 

and using the inequality ^fx -I- y < y/x + ^fy, we have 




CLi 


2=1 


\ 




2 = 1 


2=1 


2=1 


As a result, 


k / ^ \ 

Sk-\-'^ ctiUi <bl<i + y^a2 I 
2=1 \ 2=1 / 


k 

■\/^ + ^ ^ <Xi- 
2 = 1 


□ 
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We are now in shape to conclude the proof of Theorem lA.3l We apply the previous lemma to 


^ f Tfe II * II 2 

-'’‘I 


+ F{xt) - F-) < Mx") - -f” + S |- + E 
Since F{xk) — F* > 0, we have 




\x - . 


2=1 ‘ 2=1 


'ix* - < Mx*) r+E - ^*11’ 

. t -1 ^2 

- - ^ I—I 1—1 V - - ^ 


2Afc 


with 


T* II * 

2A-I'* 


Sk 


and Qi = 2.and Sk = 4’o{x*) — F* + 


£i 

A,' 


Then by Lemma lA.lOl we have 


F(a:fe)-F*< Afc < Afc (v^ + y]a,) = A^ ( + 2, 


i=l 


i=l 


which is the desired result. 


B Proofs of the Main Theorems and Propositions 

B. 1 Proof of Theorem 13.11 

Proof. We simply use Theorem lA.3l and specialize it to the choice of parameters. The initialization 
0^0 = leads to a particularly simple form of the algorithm, where ak = -^/q for all fc > 0 . 
Therefore, the sequence {Xk)k>Q from Theorem lA.3l is also simple. For all fc > 0, we indeed have 
Afc = (1 — y^)^. To upper-bound the quantity Sk from Theorem lA.3l we now remark that 70 = /i 
and thus, by strong convexity of F, 

F{xo) + -F*< 2{F{xo) - F*). 

Therefore, 


k 

2=1 



< 


\ 


+ ?ll^o - I’ll" - f+11^ + 2 I; 


2=1 
k 


£2 

A 2 


^F{xo) + £||xo - x*|P - F* + 3^ yg 



< 


k 

x/2(F(a;o)-F*) + 3y] 
2=1 



= x/2(F(a;o)-F*) 



= x/2(F(a;o) - F*) 
< VHFixo) - F*) 
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Therefore, Theorem lA.3l combined with the previous inequality gives us 

( /c+1 \ ^ 

= 2 (^) {l-pf{F{xo)-F*) 

= 2(-=^^^^=] {l-pfiF{xo)-F*) 

VVi-p- vi - 79/ 

= 2f-^^J-==^ (l-pf+\F{xo)-F*). 

Vvi-p- 71 - 79 / 

Since \/l ~ 2 ; + f is decreasing in [0,1], we have -y/l — p + f > 7 I ~ 79 Consequently, 

F(x7 - - F*). 


B.2 Proof of Proposition |3i2| 


To control the number of calls of M., we need to upper bound Gk{xk-i) — which is given by 
the following lemma: 

Lemma B.l (Relation between Gk{xk-i) and Sk-i)- 

Let {xk)k>o cmd {yk)k>Q be generated by Algorithm\J\ Remember that by definition of Xk-i, 

Gk-l{xk-l) — G 7 i ^ £k-l- 


Then, we have 


Gk{xk-i) — G*f. < 2£k-i + 




hk-i - yk-2\ 


(32) 


Proof. We first remark that for any a;, y in we have 

Gk{x) - Gk-i{x) = Gk{y) - Gk-i{y) + K(y - x,yk-i - yk- 2 ), Vfc > 2, 
which can be shown by using the respective definitions of Gk and Gk-i and manipulate the quadratic 
term resulting from the difference Gk(x) — Gk-i{x). 

Plugging X = Xk-i and y = in the previous relation yields 

Gk {Xk— 1 ) — Gk —1 (x/e— 1 ) Gk —1 {x^ ) -f ni^Xf. Xk—ljyk —1 1/fc— 2 ) 

= Gfe_i(x/c_i) — G7i + G7i ~ Gfe_i(xJ) + k(xJ — Xk-i,yk-i — yk- 2 ) 

< Efc-i + G7i ~ Gfe_i(x^) + k(xJ — Xfe_i,yfc_i — yk- 2 ) 

< Sk-i - ^^-^\\xl - Xfc-if + «(4 - a;fe-i,yfe-i - yk- 2 ), 

(33) 


where the last inequality comes from the strong convexity inequality of 

Gk-i{xl) > Gl_, + ^\\xl - xl_,r. 

Moreover, from the inequality (x, y) < ^HxH^ + |||y|p, we also have 

k{xI - x7i, yfe_i - yk- 2 ) < ^^^114 - 4-if + 2{k + ~ 

and 

1 ~ Xk—l, yk—l ~ yk—2) ^ 2 I7fc —1 — 1 II T 2 (^ -(- ll^^fc —1 — yk —2 II 


< Efc-t 


:\\yk-i - 2//C-2I 


2{k + p) 

Summing inequalities ( 1331 ), (|34| and ( iTSl l gives the desired result. 


(34) 


(35) 

□ 
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Next, we need to upper-bound the term \\yk-i — yk- 2 \\^ 7 which was also required in the convergence 
proof of the accelerated SDCA algorithm We follow here their methodology. 

Lemma B.2 (Control of the term ||yfe_i — yfe_ 2 |p.)- 

Let us consider the iterates {xk)k>o {'yk)k>o produced by Algorithm\I\ and define 

5k = C{l- pf+\F{x^)-F*), 

which appears in Theorem \3.1\ and which is such that F(xk) — F* < 6k- Then, for any k > 3, 

72 

WUk-l — 2/fc-2|P < - 6k-3- 

T 


Proof. We follow here ll26l . By definition of yk, we have 

Wvk-l - yk-2\\ = \\xk-l + I 3 k-l{xk-l - Xk-2) - Xk-2 - Pk-2{xk-2 - Xk- 3 )\\ 

< (1 -f - a:fc_2|| -f Pk-2\\xk-2 - ccfe-aH 

< 3max{||a:fe_i - Xk- 2 \\, \\xk -2 - a;fe-3||} , 

where j3k is defined in (|6ll. The last inequality was due to the fact that < 1. Indeed, the specific 
choice of ao = y/q in Theorem IA.3 1 leads to f3k = ^ 1 for all k. Note, however, that this 

relation /3fc < 1 is true regardless of the choice of a^: 

^2 ^ {ak-i-al_if ^ al_i + at_i - 2al_^ ^ aLi + «Li " 2aLi ^ ^ 

^ (aLi+a/c)^ +‘^oikal_^ + al_^ aLi + “ 

where the last equality uses the relation a]^+akal_^ = a1_^+qak from Algorithm[T] To conclude 
the lemma, we notice that by triangle inequality 

\\xk - Xk-i\\ < ||a:fe - x*!! + ||a:fc_i - x*\\, 


and by strong convexity of F 

^\\xk-x*\\^ <F(xk)-F{x*)<5k. 

As a result, 

||yfe-i - yk- 2 \\^ < 9max{||a;fe_i - Xk- 2 \\^, \\xk -2 - Xk- 3 \\‘^} 

< 36max{||a;fc_i - x*\\‘^, \\xk-2 - x*\\'^, \\xk-3 - a;*|P} 


We are now in shape to conclude the proof of Pror)osition l3.2l 
By Proposition IB. 1 l and lemma IB^ we have for all k > 3, 

^2 Y2 72k 

Gk{xk-i) — G*k < 2efe_i H-;- 6k-3 < 2efc_i H- 6k-3- 

K + fi y y 

Let {zt)t>o be the sequence of using M. to solve Gk with initialization zq = Xk-i- By assump¬ 
tion ®, we have 

Gkizt) -GI<A{1- TMY{Gk{xk-i) - Gl) < Ae-^^YGk{xk-i) - Gl). 

The number of iterations of A4 to guarantee an accuracy of £k needs to satisfy 

Ae-^^^^iGkixk-i)-Gl)<£k, 


which gives 


Tm 


1 ^^^f A{Gk{xk-i)-Gl) 
Tm V Sfe 


(36) 
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Then, it remains to upper-bound 


Gk{xk-i)-Gl 2gfc_i + ^4-3 ^ 2(l-p) + ^- ^ 2 2592/t 

£fc “ £fe (1-p)^ 1-P m( 1-p)^(v/9-p)^' 

Let us denote R the right-hand side. We remark that this upper bound holds for fc > 3. We now 
consider the cases fc = 1 and fc = 2. 

When fc = 1, Gi{x) = F{x) + ^\\x — j/oP- Note that xq = yo, then G'i(a;o) = F{xo)- As a result, 
Gi(xo) -Gt = F{xo) - F{xl) - ^\\xl - yof < F{xo) - F{xl) < F{xo) - F*. 

Therefore, 

G^jxo) - Gt ^ F{xo) - F* 9 ^ 

£i - £1 2(1-p)-""- 


When fc = 2, we remark that yi — yo = {1 + Pi){xi — xq). Then, by following similar steps as in 
the proof of Lemma lB72l we have 


\\yi - yof < M\xi - xof < 

which is smaller than . Therefore, the previous steps from the case fc > 3 apply and 
G 2 (xi)^-G 2 ^ ^ Thus, for any fc > 1, 


Tm < 


log {AR) 
Tm 


(37) 


which concludes the proof. 


B.3 Proof of Theorem [331 

We will again Theorem lA.3l and specialize it to the choice of parameters. To apply it, the following 
Lemma will be useful to control the growth of {Xk)k>o- 

Lemma B.3 (Growth of the Sequence (Afc)fe>o)- 

Let {Xk)k>o be the sequence defined in ( 1751 ) where {ak)k>o A produced by Algorithm\J]with ao = 
and p = 0. Then, we have the following bounds for all fc > 0, 

4 ^ ^ 2 

(fc-f 2)2 - - (fc-h2)2- 


Proof. Note that by definition of ak, we have for all A: > 1, 

k 


= (1 - ak)al_i = ]^(1 - ai)al = Afc+i — 


i=l 


Oo 


= Afe+1. 


With the choice of ao^ the quantity 70 defined in (fTTl i is equal to k. By Lemma lA.41 we have 


Afe < 


Y for all fc > 0 and thus ak < for all fc > 1 (it is also easy to check numerically 


- (fc+2) 

that this is also true for fc = 0 since 
lemma: 


V5-1 


k +3 

0.62 < |). We now have all we need to conclude the 


fc-i 


fc-i 


A*=n(i-«)£n 1-7 


i=0 


i=0 


> 


(fc + 2 )(fc + l) - (fc + 2 ) 2 - 


□ 


19 

























With this lemma in hand, we may now proceed and apply Theorem |A3] We have remarked in the 
proof of the previous lemma that 70 = k. Then, 




\ 


F{xo) -F* + -\\xo - + XI X" ^ V X" 

2=1 2=1 ' 

<Jf{xo) -F* + |||xo - a;*P + \[y^ 

' 2=1 ’ 

< yj f IPo - + VFM - (1 + E ^ , 


where the last inequality uses Lermna lB3] to upper-bound the ratio SijXi. Moreover, 


E' 


^E 


< 


^ dx = -. 


^ (i + 2 )i+’t /2 “ ^ F+vf^ - J xi+vf^ „ 

2=1 ' ' 2=2 ' 

Therefore, by ( fThl l from Theorem [A.31 


F{xk)-F* < Afe [ v^+2E- 

\ i=i 

(^y/F{xo) - F* (^1 + - a:*P^ 


(fc + 2)2 


< 


(fc + 2)2 

The last inequality uses (a + 6 )^ < 2(a^ + 6 ^). 


1 + 3 ) iFixo)-F*) + ^\\xo-x*r 


B.4 Proof of Proposition 13.41 

When /i = 0, we remark that Proposition IB. 1 I still holds but Lemma lB^ does not. The main difficulty 
is thus to find another way to control the quantity \\yk-i — yfc- 2 1| ■ 

Since F{xk) — F* is bounded by Theorem [33] we may use the bounded level set assumptions to 
ensure that there exists B > 0 such that \\xk — x*\\ < B for any k > 0 where x* is a minimizer 
of F. We can now follow similar steps as in the proof of Lemma IB3] and show that 

Wvk-i - t/fc-2p < 365^. 


Then by Proposition lB.il 

Gk{xk-i) - Gl < 2ek-i + 36 kB^. 

Since k > 0, Gk is strongly convex, then using the same argument as in the strongly convex case, 
the number of calls for A4 is given by 

,38, 

Tm V / 

Again, we need to upper bound it 

Gkixk-i) - Gl 2ek-i + 36kB2 _ 2 {k + 1)^+’' 162KB'^ik + 2)^+') 

Ek ~ Ek {k + 2Y+^ ^ {F{xq)-F*) 

The right hand side is upper-bounded by 0({k + 2)‘^~^'^). Plugging this relation into (l38l l gives the 
desired result. 
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C Derivation of Global Convergence Rates 

We give here a generic “template” for computing the optimal choice of k to accelerate a given 
algorithm At, and therefore compute the rate of convergence of the accelerated algorithm A. 

We assume here that At is a randomized first-order optimization algorithm, i.e. the iterates (xk) 
generated by At are a sequence of random variables; specialization to a deterministic algorithm is 
straightforward. Also, for the sake of simplicity, we shall use simple notations to denote the stopping 
time to reach accuracy e. Definition and notation using filtrations, a-algebras, etc. are unnecessary 
for our purpose here where the quantity of interest has a clear interpretation. 

Assume that algorithm A1 enjoys a linear rate of convergence, in expectation. There exists con¬ 
stants Cm,f and tm,f such that the sequence of iterates {xk)k>o for minimizing a strongly-convex 
objective F satisfies 

E[F{xk)-F*]<CM,Fil-TM,Ff ■ (39) 

Define the random variable Tm.f{£) (stopping time) corresponding to the minimum number of 
iterations to guarantee an accuracy e in the course of running A1 

TmA^) ■= inf {k > 1, F{xk) - F* < s} (40) 

Then, an upper bound on the expectation is provided by the following lemma. 

Lemma C.l (Upper Bound on the expectation of Tm,f(£)). 

Let M be an optimization method with the expected rate of convergence ( 1591 ). Then, 

E[TM{e)] < + l = o(—log(^)) , (41) 

tm \tm-sJ Vtm \ £ JJ 

where we have dropped the dependency in F to simplify the notation. 


Proof We abbreviate tm by f. Set 


To = - log --—- 

T V 1 — e ^ e 


For any /c > 0, we have 


By Markov’s inequality. 


E[A(xfe) - F*] rf < Cm e 


— kr 


F[F{xk) -F*>e]= F[TM{e) > k] < ^ 

e £ 

Together with the fact P < 1 and A: > 0. We have 

P[TAf(e) > fc+1] <min|^e-'=",ll. 


Therefore, 


To 


HTm (£)] = ^ nTM {e) >k] = J2 (£) >k]+ Y. A > k] 

fc—T q + 1 






^'m _-kT rri , _-ToT -kT 


< To + y = Tc 

k=To ^ 

Cm e-^^o 


0 ~i -e 

£ 


k=0 


= To + 


= Tn + l. 


£ 1 — 

As simple calculation shows that for any t £ (0,1), -^ < 1 — e~'^ and then 

IE[TA,(£)] < To-f 1 = - log f + 1 < - log f-^ I + 1- 


(42) 


□ 
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Note that the previous lemma mirrors Eq. (I36ll371 i in the proof of Prop. 13. ll in Appendix iBl For all 
optimization methods of interest, the rate tm .Gj, is independent of k and varies with the parameter k. 
We may now compute the iteration-complexity (in expectation) of the accelerated algorithm A —that 
is, for a given e, the expected total number of iterations performed by the method A4. Let us now 
fix e > 0. Calculating the iteration-complexity decomposes into three steps: 

1. Find K that maximizes the ratio TM,Gkl\/k‘ + for algorithm Ai when F is /i-strongly 
convex. In the non-strongly convex case, we suggest maximizing instead the ratio 
TM,Gkl'J^ Note that the choice of k is less critical for non-strongly convex prob¬ 
lems since it only affects multiplicative constants in the global convergence rate. 

2. Compute the upper-bound of the number of outer iterations using Theorem l3.1l (for the 
strongly convex case), or Theorem [33] (for the non-strongly convex case), by replacing k 
by the optimal value found in step 1. 

3. Compute the upper-bound of the expected number of inner iterations 

max E[TM,Gk{^k)] < Mn, 

fc=l,...,fcout 

by replacing the appropriate quantities in Eq. |4T]for algorithm Ad; for that purpose, the 
proofs of Propositions [331 of l3.4l mv be used to upper-bound the ratio C_\ 4 ^Ckl^k, or another 
dedicated analysis for Ai may be required if the constant CM,Gk does not have the required 
form A{Gk{zQ) — Gl) in (|8]i. 

Then, the iteration-complexity (in expectation) denoted Comp, is given by 

Comp < kin X fcout ■ (43) 


D A Proximal MISO/Finito Algorithm 

In this section, we present the algorithm MISO/Finito, and show how to extend it in two ways. First, 
we propose a proximal version to deal with composite optimization problems, and we analyze its 
rate of convergence. Second, we show how to remove a large sample condition n > 2L//r, which 
was necessary for the convergence of the algorithm. The resulting algorithm is a variant of proximal 
SDCA ll^ with a different stepsize and a stopping criterion that does not use duality. 


D.l The Original Algorithm MISO/Finito 


MISO/Finito was proposed in flTl and jT] for solving the following smooth unconstrained convex 
minimization problem 

/.(*)}. («) 
where each is differentiable with L-Lipschitz continuous derivatives and /r-strongly convex. At 
iteration k, the algorithm updates a list of lower bounds df of the functions fi, by randomly picking 
up one index ik among {1, • • • ,n} and performing the following update 


dHx) 


f^{xk-i) + fiixk-i),x - Xk-i) + ^\\x - Xk-iW^ if i = ik 
di~^{x) otherwise ’ 


which is a lower bound of fi because of the p-strong convexity of fi. Equivalently, one may perform 
the following updates 


f Xk-i - Mxk-i) if i=ik 
\ Zi~^ otherwise ’ 


and all functions dfl have the form 


where cf is a constant, 
iterate (x^): 



Then, MISO/Finito performs the following minimization to produce the 


lb -| u 

Xfc=argmin-^d,nx) = -^zf, 
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which is equivalent to 


xk ^ xk-i - ^ (4 - 4 ■ 

In many machine learning problems, it is worth remarking that each function fi{x) has the spe¬ 
cific form fi{x) = li{{x^ Wi)) + ^||a;|p. In such cases, the vectors 4 can be obtained by storing 
only 0{n) scalars^ The main convergence result of m is that the procedure above converges with 
a linear rate of convergence of the form Q, with tmiso = l/3n (also refined in 1 /2n in I?)), when 
the large sample size constraint n> 2L//r is satisfied. 

Removing this condition and extending MISO to the composite optimization problem ([T]i is the 
purpose of the next section. 

D.2 Proximal MISO 

We now consider the composite optimization problem below, 

mm|F(x) = l|:/.(x)+V'(x)|, 

where the functions are differentiable with L-Lipschitz derivatives and /r-strongly convex. As in 
typical composite optimization problems, ip is convex but not necessarily differentiable. We assume 
that the proximal operator of ip can be computed easily. The algorithm needs to be initialized with 
some lower bounds for the functions fi: 

Mx)>f^\\x-z°\\^ + c°, (Al) 

which are guaranteed to exist due to the /r-strong convexity of fi. For typical machine learning 
applications, such initialization is easy. For example, logistic regression with £ 2 -regularization sat¬ 
isfies (lAll l with 2 ° = 0 and c° = 0. Then, the MISO-Prox scheme is given in Algorithmic Note 
that if no simple initialization is available, we may consider any initial estimate zq in and define 
4 = Zq — {1 /fi)V fi{zo) , which requires performing one pass over the data. 

Then, we remark that under the large sample size condition n > 2L//i, we have <5=1 and the 
update of the quantities z^ in (l45l l is the same as in the original MISO/Finito algorithm. As we will 
see in the convergence analysis, the choice of 6 ensures convergence of the algorithm even in the 
small sample size regime n < 2L//i. 

Relation with Proximal SDCA |[25l . The algorithm MISO-Prox is almost identical to variant 5 
of proximal SDCA 1251, which performs the same updates with 6 = ^n/{L + fin) instead of ^ = 
min(l, 2 {l-ij.) )■ however not clear that MISO-Prox actually performs dual ascent steps in the 
sense of SDCA since the proof of convergence of SDCA cannot be directly modified to use the 
stepsize of proximal MISO and furthermore, the convergence proof of MISO-Prox does not use the 
concept of duality. Another difference lies in the optimality certificate of the algorithms. Whereas 
Proximal-SDCA provides a certificate in terms of linear convergence of a duality gap based on 
Fenchel duality, Proximal-SDCA ensures linear convergence of a gap that relies on strong convexity 
but not on the Fenchel dual (at least explicitly). 

Optimality Certificate and Stopping Criterion. Similar to the original MISO algorithm. Prox¬ 
imal MISO maintains a list (df) of lower bounds of the functions fi, which are updated in the 
following fashion 

(I - S)d^~^{x)+6 {fi{xk-i) + {'^Mxk-i),x - Xk-i) + ^\\x - Xk-iW^) if i = ik 
’ \ d^-\x) otherwise 

(46) 


^Note that even though we call this algorithm MISO (or Finito), it was called MISO/r in m, whereas 
“MISO” was originally referring to an incremental majorization-minimization procedure that uses upper bounds 
of the functions fi instead of lower bounds, which is appropriate for non-convex optimization problems. 
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Algorithm 2 MISO-Prox: an improved MISO algorithm with proximal support, 
input such that (lAll holds; N (number of iterations); 

1: initialize zq = ^ 4 xq = prox^/^[zo]; 

2: dehne (5 = min(l, 2p^); 

3: for fc = 1,..., do 

4: randomly pick up an index ifc in {1,..., n}; 

5: update 


z^ = 


(1 - S)zi ^ + 6 ( xk-i - ^ Vfi { xk - i)'j 


-fe-i 


Zk 




Xk = prox.^/^[zfc]. 


if i = ik 

otherwise 


6: end for 

output xn (hnal estimate). 


(45) 


Then, the following function is a lower bound of the objective F: 

n 

Dk{x) =-'^di{x)+ip{x), (47) 

i—l 

and the update ( |45] ) can be shown to exactly minimize Dk- As a lower bound of F, we have 
that Dkixk) < F* and thus 


F{xk) - F* < F{xk) - Dk{xk). 

The quantity F{xk) — Dk{xk) can then be interpreted as an optimality gap, and the analysis below 
will show that it converges linearly to zero. In practice, it also provides a convenient stopping 
criterion, which yields Algorithm|3] 


Algorithm 3 MISO-Prox with stopping criterion. 

input (z?, such that (lAll l holds; e (target accuracy); 

1 : initialize zq = ^ 111=1 cf = c° + f polP for all On {1 ,..., n} and xq = prox^/^[zo]; 

2 : Dehne 5 = min ^1, 2 (11)1 } ) ^ = 0; 

3: while ^111=1 fi{xk) - c'l + fj.{zk,Xk) - §\\xk\\^ > e do 

4: for Z = 1,..., n do 

5‘. k i — A; 4“ 1; 

6 : randomly pick up an index O in {1 ,..., n}; 

7: perform the update (l45l i; 

8 : update 


„/fe _ / ^ +5 (/*(Xfe_i) - (V/j(Xfc_i),Xfc_i) -f f ||Xfc_i|P) 

O - I ,k-. 

9: end for 

10: end while 

output Xn (hnal estimate such that F{xn) — F* < e). 


if i = ik 

otherwise 

(48) 


To explain the stopping criterion in Algorithmic we remark that the functions are quadratic and 
can be written 


dlix) 



(49) 
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where the cj’s are some constants and c'^ 
recursively these constants c'^, and finally 


+ 2 II11^- Equation (l48T l shows how to update 

- ^i{xk,Zk) + ^llxfcll^ + 1 p{xk), 


and 

F(xfc) - Dkixk) = + ^J-{xk,Zk) - 

which justifies the stopping criterion. Since computing F{xk) requires scanning all the data points, 
the criterion is only computed every n iterations. 


Convergence Analysis. The convergence of MISO-Prox is guaranteed by Theorem |4T| from the 
main part of paper. Before we prove this theorem, we note that this rate is slightly better than the 
one proven in MISO iflTl . which converges as (1 — We start by recalling a classical lemma 

that provides useful inequalities. Its proof may be found in iflTl . 

Lemma D.l (Classical Quadratic Upper and Lower Bounds). 

For any function g : Rp —^ R which is fj.-strongly convex and differentiable with L-Lipschitz deriva¬ 
tives, we have for all x, y in Rp, 

^\\x - 2/f < g{x) - g{y) + {Vg{y),x- y) < - vf- 

To start the proof, we need a sequence of upper and lower bounds involving the functions Dk and 
Dk-i- The first one is given in the next lemma 

Lemma D.2 (Lower Bound on Dk). 

For all k > 1 and x in Rp, 

Dk{x) > Dk-i{x) - VxeRP. (50) 

2 n 

Proof For any j G {1,..., n}, fi satisfies the assumptions of Lemma lPTl and we have for all k > 
0, X in RP, and for i = ik, 

dHx) = (1 - S)d^~^{x) +S[fi{xk-i) + (V/,(xfc_i),x- Xfe_i) + ^||x - Xfe_i||^] 

> (1 - S)dt\x) + SMx) - - Xk-if 

> - ^^^y-^llx-X/c-if, 

where the definition of df is given in (l46l l. The first inequality uses Lemma iDTl and the last 
one uses the inequality fi > d\-\ From this inequality, we can obtain dSOl l by simply using 

Dk{x) =Y.l:=id’l{x)+il}{x) =Dk-i{x) + ^{d\{x)-d^-^{x)). □ 

Next, we prove the following lemma to compare Dk and Dk-i- 

Lemma D.3 (Relation between Dk and Dk-f). 

For all k > 0, for all x and y in RP, 

Dk{x) - Dkiy) = Dk-i{x) - Dk-i{y) - pfzk - Zk-i,x - y). (51) 

Proof Remember that the functions d^ are quadratic and have the form ( |49] ), that Dk is defined 
in (E3, and that Zk minimizes i ^17=1 di- Then, there exists a constant Ak such that 

Dk{x) = Ak + ^\\x - ZkW^ + tpix). 

This gives 

Dkix) - Dkiy) = ^l|a; - ^fe||^ “ f ^1^^ ~ ~ 
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Similarly, 


Dk-i{x) - Dk-i{y) = ^\\x - Zfe_i|p - f^\\y - Zk-iW^ + tpix) - ilj{y). (53) 


Subtracting (l52l) and (l53]) gives (fSTI) . 


□ 


Then, we are able to control the value of Dk{xk-i) in the next lemma. 

Lemma D.4 (Controlling the value Dk{xk-i))- 
For any k>l, 

Dk{xk-i) - Dk{xk) < ^\\zk - Zk-i\\^. (54) 

Proof. Using Lemma |D3] with x = Xk-i and y = Xk yields 

Dk{xk-i) - Dk[xk) = Dk-i{xk-i) - Dk-i{xk) - y{zk - Zk-i,Xk-i - Xk). 
Moreover Xk-i is the minimum of Dk-i which is /r-strongly convex. Thus, 

Dk-i{xk-i) + ^\\xk - Xk-iW^ < Dk-i{xk). 

Adding the two previous inequalities gives the first inequality below 

Dk{xk-i) - Dk{xk) < -^\\xk - - y{zk - Zk-i,Xk-i - Xk) < ^\\zk - Zk-i\\'^, 

and the last one comes from the basic inequality 5 ||a||^ + (a, h) + >0. □ 

We have now all the inequalities in hand to prove Theorem l4.ll 


Proof of Theorem 14.71 

We start by giving a lower bound of Dk{xk-i) — Dk-i{xk-i). 

Take x = Xk-i in (fsTT i. Then, for all y in 

Dk{xk-i) - Dk-i{xk-i) =Dk{y) - Dk-i{y) + y{zk - Zk-i,v - Xk-i) 

X f r ^ 

by dSO]) >- ~ ^k-iW^ + F{zk - Zk-i,y- Xk-i) 

In 

Choose y that maximizes the above quadratic function, i.e. 

y = Xk-i + 777 - Azk - Zfc_i), 

d{L- n) 

and then 

^kixk—l) Dk—\{xk—\) ^ rTFTT ill 

2d[L — y) 

by dSH) > [Dk{xk-i) - Dk{xk)] ■ 

Then, we start introducing expected values. 

By construction 

l) — 1 (^fc—1) l) (^/c—l))- 

n ^ 

After taking expectation, we obtain the relation 

E[Dkixk-i)] = ( 1 - - ) E[Dk-iixk-i)] + -E[F(a;fe_i)]. 

\ n J n 


(55) 


( 56 ) 


We now introduce an important quantity 


T = 



5{L-P) \ 
ny J 


5 

? 

n 
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and combine (fSST l with (l56l l to obtain 

TE[F{xk-i)] - < -(1 - T)E[Dk-i{xk-i)]. 

We reformulate this relation as 

r (E[F(xfe_i)] - F*) + {F* - E[Dk{xk)]) < (1 - r) {F* - E[Dk-i{^k-i)]) ■ (57) 

On the one hand, since F{xk-i) > F*, we have 

F* - E[Dk{xk)] < (1 - r) (F* - E[Ffc_i(xfc_i)]). 

This is true for any fc > 1, as a result 

F* - E[Dk{xk)] < (1 - rf {F* - Doixo)) ■ (58) 

On the other hand, since F* > Dk{xk), then 

r (E[F(xfe_i)] - F*) < (1 - r) (F* - E[Dk-i{xk-i)]) < (1 - r)'= (F* - Fo(xo)), 

which gives us the relation (O of the theorem. We conclude giving the choice of S. We choose it to 
maximize the rate of convergence, which turns to maximize r. This is a quadratic function, which 
is maximized at (5 = 2{L-fj.) ' However, by definition 5 < 1. Therefore, the optimal choice of 6 is 
given by 

0 = mm < 1, —-- >. 

I ’2(L-/r)/ 

Note now that 

1- When < 1, we have <5 = and r = 

2. When 1 < , , we have <5 = 1 and t = - — 

Therefore, r > min which concludes the first part of the theorem. 

To prove the second part, we use (fSST l and which gives 

E[F(xfe) - Dk{xk)] = E[F(xfc)] - F* + F* - E[Dk{xk)] 

< i(i _ rf+^F* - Doixo) + (1 - t)'=(F* - Doixo)) 

T 

= Doixo)). 

T 

□ 


D.3 Accelerating MISO-Prox 

The convergence rate of MISO (or also SDCA) requires a special handling since it does not satisfy 
exactly the condition ® from Pror)osition l3.2l The rate of convergence is linear, but with a constant 
proportional to F* — Doixo) instead of Fixo) — F* for many classical gradient-based approaches. 

To achieve acceleration, we show in this section how to obtain similar guarantees as Pror)osition l3.2l 
and 13.41 —that is, how to solve efficiently the subproblems (|5]l. This essentially requires the right 
initialization each time MISO-Prox is called. By initialization, we mean initializing the variables 

Assume that MISO-Prox is used to obtain Xk-i from Algorithm[T]with Gk-iixk-i) — GJ < Sk-i, 
and that one wishes to use MISO-Prox again on Gk to compute Xk- Then, let us call D' the lower- 
bound of Gk-i produced by MISO-Prox when computing Xk-i such that 

Xk-i = argmin i D' ix) = - + V’(a;) \ , 

with 

d'iix) = + c'. 
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Note that we do not index these quantities with fc — 1 or fc for the sake of simplicity. The convergence 
of MISO-Prox may ensure that not only do we have Gk-i{xk-i) — < efc-i, but in fact we have 

the stronger condition — D'{xk-i) < Sk-i- Remember now that 

Gk{x) = Gk-i{x) + ^\\x- vk-if -^\\x- yk- 2 r, 

and that D' is a lower-bound of Gk-i- Then, we may set for alH in {1,..., n} 

di{x) = d'iix) + - |||x- yfe-2f, 

which is equivalent to initializing the new instance of MISO-Prox with 

Zi = z- + —^{vk-i - yk- 2 ), 

K + y, 

and by choosing appropriate quantities c°. Then, the following function is a lower bound of Gk 

1 


Do{x) = diix) + 

n ^ ^ 


and the new instance of MISO-Prox to minimize Gk and compute Xk will produce iterates, whose 
first point, which we call x°, minimizes Dq. This leads to the relation 


x° = prox^/(„+^) [z°] = prox^/(^+^) 


K + y 


iVk-i - yk-2) 


where we use the notation ^ X]r=i ~ n ^'i Algorithm|2] 

Then, it remains to show that the quantity GJ — Dq{x^) is upper bounded in a similar fashion as 
Gk{xk-i) — GJ in Propositions l3.2l and l3.4l to obtain a similar result for MISO-Prox and control the 
number of inner-iterations. This is indeed the case, as stated in the next lemma. 

Lemma D.5 (Controlling Gk{xk-i) — G^ for MISO-Prox). 

When initializing MISO-Prox as described above, we have 


Gl-D^{x^)<ek-i + 


2(/^ -f /i) 


\yk-i - yk-2\ 


Proof. By strong convexity, we have 

^o(x°) + |||x° - yk-2f - = D',{x°) > D',{xk-i) + ^||x° - . 

Consequently, 

Do{x°) > D'{xk-i) - |||x° - yk-2\\‘^ + ^\\x° - yk-iW^ + - Xfc_i|p 

= Do{xk-i) + |||xfc-i - yk-2f - ^\\xk-i - Vk-if - - yk-2f + ^\\x° - yk-i\ 

+ - Xk-i\\^ 

= Do{xk-i) - k{x° - Xk-i,yk-i - yk-2) + ^ „ ^ \\x° - Xfc_i|p 


> Do{xk-i) - 


2{k -f y) 


\\yk-i - t/fc-zll" 


where the last inequality is using a simple relation ^||a|p -f 2{a,b) -f > 0. As a result. 


G*k-Do{x°) < Gl-Do{xk-i) + 


2(^k y) 


Wvk-i - yk-2\\^ 


< Gk{xk-i) — Go(xfc-i) + 


2{k -I- y) 


— Gk—li.Xk—l) D (^Xk — l) -\- 


2(^k -f y) 


Wk-i - yk- 2 \ 


\\yk-i - yk-2\ 


< £fe-i 


2{k y) 


\\yk-i - yk- 2 f 


□ 
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We remark that this bound is half of the bound shown in (l3^ . Hence, a similar argument gives 
the bound on the number of inner iterations. We may finally compute the iteration-complexity of 
accelerated MISO-Prox. 

Proposition D.6 (Iteration-Complexity of Accelerated MISO-Prox). 

When F is ^-strongly convex, the accelerated MISO-Prox algorithm achieves the accuracy e with 
an expected number of iteration upper bounded by 



Proof. When n > 2(L — p)/p,, there is no acceleration. The optimal value for n is zero, and we 
may use Theorem l4.1l and Lemma lCTI to obtain the complexity 




When n < 2{L — p)/p, there is an acceleration, with k = 2{L — p)/p — p. Let us compute the 
global complexity using the “template” presented in Appendix ICl The number of outer iteration is 
given by 


fcout = 0 { J — log 


Fixo) - F* 


At each inner iteration, we initialize with the value a;° described above, and we use Lemma ID31 


~ ^0(2;°) < Efe-l + —llj/fe-i — J//c_ 2 |p. 


Then, 


where 


R = 


Gl - Do{x^) ^ R 

Sk “ 2 ’ 

2592k 


1-p p{l-pYiy/q-pf 


= 0 


npj 


With Miso-Prox, with have thus the expected number of inner iteration is given by 

Lemma lC.il 


^in = 0{nlog{n^R)) = O ( nlog ( — 


As a result. 


Comp = 0|,/lLl„g/'r(x„)-F' 


log ( - 


To conclude, the complexity of the accelerated algorithm is given by 


L / nL 


^ log - log - 




L 


□ 


E Implementation Details of Experiments 

In the experimental section, we compare the performance with and without acceleration for three 
algorithms SAG, SAGA and MISO-Prox on Z 2 -logistic regression problem. In this part, we clarify 
some details about the implementation of the experiments. 

Firstly, we normalize the observed data before running the regression. Then we apply Catalyst using 
parameters according to the theoretical settings. Standard analysis of the logistic function shows that 
the Lipschitz gradient parameter L is 1/4 and strongly convex parameter p = 0 when there is no 
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regularization. Adding properly a Z 2 term generates the strongly-convexregimes. Several parameters 
need to be fixed at the beginning stage. The parameter k is set to its optimal value suggested by 
theory, which only depends on n, p, and L. More precisely, k writes as k = a(L — p)/(n + 6) — p, 
with {a,b) = (2,-2) for SAG, {a,b) = (1/2,1/2) for SAGA and (a, &) = (1,1) for MISO- 
Prox. The parameter ao is initialized as the positive solution of + (1 — q)x — 1 = 0 where 
q = n/lfj, + k). Furthermore, since the objective function is always positive, F{xo) — F* can 
be upper bounded by F{xo) which allow us to set the Sk = (2/9)F(xo)(l — p)^ in the strongly 
convex case and Sk = 2F{xo)/9{k + in the non-strongly convex case. Finally, we set the 

free parameter in the expression of Sk as follows. We simply set p = 0.9y/q in the strongly convex 
case and p = 0.1 in the non strongly convex case. 

To solve the subproblem at each iteration, the step-sizes parameter for SAG, SAGA and MISO are set 
to the values suggested by theory, which only depend on p, L and k. All of the methods we compare 
store n gradients evaluated at previous iterates of the algorithm. For MISO, the convergence analysis 
in AppendixiDlleads to the initialization Xk-i + — yk- 2 ) that moves Xk-i closer to yu-i 

and further away from yk-2- We found that using this initial point for SAGA was giving slightly 
better results than Xk-i- 
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